Project

In a study on pollution, carried out in 41 cities in the USA, the following variables were considered:

so2 - Sulfur dioxide content of air in micrograms per cubic meter

temp -Average annual temperature (F)

manuf - No. of manufacturing enterprises employing 20 or more workers

pop - Population size (1970 census) in thousands

wind - Average wind speed in miles per hour

precip - Average annual precipitation in inches

days - Average number of days with precipitation per year (annual average number of days of rain)

Uploading the Data

UPLOADING THE DATA

#RUN DATA
data5 <- read.csv("data_5.csv", header = TRUE)
head(data5)
##       city so2 temp manuf pop wind precip days
## 1  Phoenix  10 70.3   213 582  6.0   7.05   36
## 2 Little R  13 61.0    91 132  8.2  48.52  100
## 3 San Fran  12 56.7   453 716  8.7  20.66   67
## 4   Denver  17 51.9   454 515  9.0  12.95   86
## 5 Hartford  56 49.1   412 158  9.0  43.37  127
## 6 Wilmingt  36 54.0    80  80  9.0  40.25  114

CHOOSING THE VARIABLES

All but the city since it’s not numeric.

dim(data5)
## [1] 41  8
data5_variables <- data5[,2:8]
head(data5_variables)
##   so2 temp manuf pop wind precip days
## 1  10 70.3   213 582  6.0   7.05   36
## 2  13 61.0    91 132  8.2  48.52  100
## 3  12 56.7   453 716  8.7  20.66   67
## 4  17 51.9   454 515  9.0  12.95   86
## 5  56 49.1   412 158  9.0  43.37  127
## 6  36 54.0    80  80  9.0  40.25  114

1) Preliminary Analysis of the Data

summary(data5_variables)
##       so2              temp           manuf             pop        
##  Min.   :  8.00   Min.   :43.50   Min.   :  35.0   Min.   :  71.0  
##  1st Qu.: 13.00   1st Qu.:50.60   1st Qu.: 181.0   1st Qu.: 299.0  
##  Median : 26.00   Median :54.60   Median : 347.0   Median : 515.0  
##  Mean   : 30.05   Mean   :55.76   Mean   : 463.1   Mean   : 608.6  
##  3rd Qu.: 35.00   3rd Qu.:59.30   3rd Qu.: 462.0   3rd Qu.: 717.0  
##  Max.   :110.00   Max.   :75.50   Max.   :3344.0   Max.   :3369.0  
##       wind            precip           days      
##  Min.   : 6.000   Min.   : 7.05   Min.   : 36.0  
##  1st Qu.: 8.700   1st Qu.:30.96   1st Qu.:103.0  
##  Median : 9.300   Median :38.74   Median :115.0  
##  Mean   : 9.444   Mean   :36.77   Mean   :113.9  
##  3rd Qu.:10.600   3rd Qu.:43.11   3rd Qu.:128.0  
##  Max.   :12.700   Max.   :59.80   Max.   :166.0

PreProcessing of the Data

Since the measures of each column are very different, we have to use a correlation matrix to start the PCA.

CORRELATION MATRIX

## 1st) Determine the correlation matrix

cor_data5 <- cor(data5_variables)
cor_data5
##                so2        temp       manuf         pop        wind      precip
## so2     1.00000000 -0.43360020  0.64476873  0.49377958  0.09469045  0.05429434
## temp   -0.43360020  1.00000000 -0.19004216 -0.06267813 -0.34973963  0.38625342
## manuf   0.64476873 -0.19004216  1.00000000  0.95526935  0.23794683 -0.03241688
## pop     0.49377958 -0.06267813  0.95526935  1.00000000  0.21264375 -0.02611873
## wind    0.09469045 -0.34973963  0.23794683  0.21264375  1.00000000 -0.01299438
## precip  0.05429434  0.38625342 -0.03241688 -0.02611873 -0.01299438  1.00000000
## days    0.36956363 -0.43024212  0.13182930  0.04208319  0.16410559  0.49609671
##               days
## so2     0.36956363
## temp   -0.43024212
## manuf   0.13182930
## pop     0.04208319
## wind    0.16410559
## precip  0.49609671
## days    1.00000000

OBTAIN THE EIGENVALUES AND EIGENVECTORS

## 2nd) Obtain eigenvectors and eigenvalues

eigen_data5 <- eigen(cor_data5)
eigen_data5
## eigen() decomposition
## $values
## [1] 2.72811968 1.51233485 1.39497299 0.89199129 0.34677866 0.10028759 0.02551493
## 
## $vectors
##               [,1]        [,2]       [,3]        [,4]       [,5]        [,6]
## [1,]  0.4896988171 -0.08457563 -0.0143502  0.40421007  0.7303942 -0.18334573
## [2,] -0.3153706901  0.08863789 -0.6771362 -0.18522794  0.1624652 -0.61066107
## [3,]  0.5411687028  0.22588109 -0.2671591 -0.02627237 -0.1641011  0.04273352
## [4,]  0.4875881115  0.28200380 -0.3448380 -0.11340377 -0.3491048  0.08786327
## [5,]  0.2498749284 -0.05547149  0.3112655 -0.86190131  0.2682549 -0.15005378
## [6,]  0.0001873122 -0.62587937 -0.4920363 -0.18393719  0.1605988  0.55357384
## [7,]  0.2601790729 -0.67796741  0.1095789  0.10976070 -0.4399698 -0.50494668
##              [,7]
## [1,]  0.149529278
## [2,] -0.023664113
## [3,] -0.745180920
## [4,]  0.649125507
## [5,]  0.015765377
## [6,] -0.010315309
## [7,]  0.008217393

CRITERIA 1 => ANALYSING WITH KAISER’S CRITERIA

Analyzing the results, we should retain the first 3 PCs because there are 3 eigenvalues that are >1 (Kaiser’s Criteria).

But now we will perform PCA to see if 3 PCs explain the majority of the

CRITERIA 2 => PCA ANALYSIS

# Perfom PCA

pca_data5 <- princomp(data5_variables, cor = TRUE)

print(summary(pca_data5), loadings = TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.6517021 1.2297702 1.1810897 0.9444529 0.58887916
## Proportion of Variance 0.3897314 0.2160478 0.1992819 0.1274273 0.04953981
## Cumulative Proportion  0.3897314 0.6057792 0.8050611 0.9324884 0.98202821
##                           Comp.6      Comp.7
## Standard deviation     0.3166822 0.159733920
## Proportion of Variance 0.0143268 0.003644989
## Cumulative Proportion  0.9963550 1.000000000
## 
## Loadings:
##        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## so2     0.490                0.404  0.730  0.183  0.150
## temp   -0.315         0.677 -0.185  0.162  0.611       
## manuf   0.541 -0.226  0.267        -0.164        -0.745
## pop     0.488 -0.282  0.345 -0.113 -0.349         0.649
## wind    0.250        -0.311 -0.862  0.268  0.150       
## precip         0.626  0.492 -0.184  0.161 -0.554       
## days    0.260  0.678 -0.110  0.110 -0.440  0.505

Analyzing the results, we have 81% of the variance explained with the first 3 PCs (viewed in the cumulative proportion). However, if we consider 4 PCs, we will have 93% of the variance explained.

Since this is not conclusive, we should do the third criteria.

CRITERIA 3 => SCREE PLOT

## Calculate total variance explained by each PC

var_explained_data5 = pca_data5$sdev^2 / sum(pca_data5$sdev^2)

## Create scree-plot
library(ggplot2)

qplot(c(1:7), var_explained_data5) +
  geom_line() +
  xlab("Principal Component") +
  ylab("Variance Explained") +
  ggtitle("Scree Plot")+
  ylim(0,1)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Analyzing the results, in this plot, the elbow point would be between PC2 and PC3. However, if we consider he other criterias we have applied before, it would be best to use 3 PCs, having 81% of the variance explained.

SO LET’S CONSIDER 3 PCs

Now, we need to see which are the variables that contribute more for the explanation of each principal component retained. We need to have a correlation matrix between the variables and the PCs. (Loadings)

# Correlation between original variables and the PCs (loadins) 

component_matrix <- cor(data5_variables,pca_data5$scores)
component_matrix
##               Comp.1      Comp.2      Comp.3      Comp.4      Comp.5
## so2     0.8088365434  0.10400859  0.01694887  0.38175738  0.43011391
## temp   -0.5208984175 -0.10900423  0.79975860 -0.17493906  0.09567234
## manuf   0.8938494595 -0.27778184  0.31553891 -0.02481301 -0.09663572
## pop     0.8053502866 -0.34679989  0.40728458 -0.10710452 -0.20558056
## wind    0.4127189331  0.06821718 -0.36763244 -0.81402520  0.15796972
## precip  0.0003093839  0.76968782  0.58113903 -0.17372001  0.09457328
## days    0.4297383099  0.83374415 -0.12942257  0.10366381 -0.25908903
##             Comp.6       Comp.7
## so2     0.05806232  0.023884898
## temp    0.19338547 -0.003779961
## manuf  -0.01353294 -0.119030670
## pop    -0.02782473  0.103687362
## wind    0.04751936  0.002518265
## precip -0.17530696 -0.001647705
## days    0.15990761  0.001312596

We only need to look into the the 3 first PCs.

Comp 1 => explain so2, manuf and pop

Comp 2 => explain precip and days

Comp 3 => temp

Wind is only explained well with the 4th PC.

Let’s prove that with:

# Apply the rule for the 1st PC
# lambda1 / p

sqrt(eigen_data5$values[1]/8)
## [1] 0.5839649
# Apply the rule for the 2nd PC
# lambda1 / p

sqrt(eigen_data5$values[2]/8)
## [1] 0.4347894
# Apply the rule for the 3rd PC
# lambda1 / p

sqrt(eigen_data5$values[3]/8)
## [1] 0.4175783

GRAPHICAL REPRESENTATION

If we did a graphic just for the first two PCS:

library(devtools)
## Loading required package: usethis
require(ggbiplot)
## Loading required package: ggbiplot
ggbiplot(pca_data5,labels = row.names(data5_variables))

However, we decided to work with the first 3 PCs. So, if we did a graphic for the first 3 PCs:

library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Supondo que você tenha feito o PCA:
pca_data5_2 <- prcomp(data5_variables, scale. = TRUE)

pca_scores <- as.data.frame(pca_data5_2$x)

var_explained <- (pca_data5$sdev)^2 / sum((pca_data5$sdev)^2)

pc1_label <- paste0("PC1 (", round(var_explained[1] * 100, 1), "%)")
pc2_label <- paste0("PC2 (", round(var_explained[2] * 100, 1), "%)")
pc3_label <- paste0("PC3 (", round(var_explained[3] * 100, 1), "%)")

plot_ly(pca_scores,
        x = ~PC1,
        y = ~PC2,
        z = ~PC3,
        type = 'scatter3d',
        mode = 'markers',
        text = rownames(pca_scores),
        marker = list(size = 4)
)%>%
  layout(
    title = "3D Graphic with the first 3 PCs",
    scene = list(
      xaxis = list(title = pc1_label),
      yaxis = list(title = pc2_label),
      zaxis = list(title = pc3_label)
    )
  )